Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD
نویسندگان
چکیده
This paper 1 studies how to schedule hyperparameters improve generalization of both centralized single-machine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). augmented with momentum variants (e.g., heavy ball (SHB) Nesterov’s accelerated (NAG)) has been the default optimizer for many tasks, in environments. However, advanced variants, despite empirical advantage over classical SHB/NAG, introduce extra tune. The error-prone tuning is main barrier AutoML. Centralized : We first focus on show efficiently a large class generalization. propose unified framework called multistage quasi-hyperbolic (Multistage QHM), which covers family as its special cases (e.g. vanilla SGD/SHB/NAG). Existing works mainly only scheduling learning rate α ’s decay, while QHM allows additional varying factor), demonstrates better than . convergence general nonconvex objectives. Distributed then extend our theory (ASGD), parameter server distributes data batches several worker machines updates parameters via aggregating batch gradients from workers. quantify asynchrony between different workers (i.e., staleness), model dynamics iterations differential equation (SDE), derive PAC-Bayesian bound ASGD. As byproduct, we moderately helps ASGD generalize better. Our strategies have rigorous justifications rather blind trial-and-error theoretically prove why could decrease derived errors cases. simplify process beat competitive optimizers test accuracy empirically. codes are publicly available https://github.com/jsycsjh/centralized-asynchronous-tuning.
منابع مشابه
Faster Asynchronous SGD
Asynchronous distributed stochastic gradient descent methods have trouble converging because of stale gradients. A gradient update sent to a parameter server by a client is stale if the parameters used to calculate that gradient have since been updated on the server. Approaches have been proposed to circumvent this problem that quantify staleness in terms of the number of elapsed updates. In th...
متن کاملImproving Generalization Performance by Switching from Adam to SGD
Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switc...
متن کاملThe Effects of Hyperparameters on SGD Training of Neural Networks
The performance of neural network classifiers is determined by a number of hyperparameters, including learning rate, batch size, and depth. A number of attempts have been made to explore these parameters in the literature, and at times, to develop methods for optimizing them. However, exploration of parameter spaces has often been limited. In this note, I report the results of large scale exper...
متن کاملTheory of Deep Learning III: Generalization Properties of SGD
In Theory III we characterize with a mix of theory and experiments the consistency and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predicitve performance when overparametrization relative to the number of training data suggests overfitting. We describe an exp...
متن کاملStatistical inference using SGD
We present a novel method for frequentist statistical inference in M -estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the OrnsteinUhlenbeck process suggests that such averages are asymptotically normal. From a prac...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Knowledge Discovery From Data
سال: 2022
ISSN: ['1556-472X', '1556-4681']
DOI: https://doi.org/10.1145/3544782